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Here, we present the complete genome of the extreme thermophile, Dictyoglomus thermophilum H-6-12 (phylum Dictyoglomi), 
which consists of 1,959,987 bp. 
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Dictyoglomus thermophilum is an extremely thermophilic, 
chemo-organotrophic, cellulolytic, obligate anaerobe origi- 
nally isolated from a hot spring in Japan (1). The cells are rod- 
shaped, non-spore-forming, nonmotile, and form unusual spher- 
ical bodies of up to 100 cells. D. thermophilum and Dictyoglomus 
turgidum comprise the only two species in the Dictyoglomi phy- 
lum. The genome of D. turgidum has been sequenced, and the 
strain is unable to utilize cellulose (2). D. thermophilum was se- 
lected in 2002 as part of a National Science Foundation-funded 
"Assembling the Tree of Life" project at The Institute for Genomic 
Research (TIGR) to sequence the genomes of representatives of 
the seven phyla of bacteria that at the time had cultured represen- 
tatives but no available genome sequence. 

D. thermophilum was obtained from the ATCC, grown, and its 
DNA was extracted using standard techniques. Sanger sequencing 
and genome assembly were performed as previously described for 
genomes sequenced by TIGR (3-5). Small- and large-insert plas- 
mid libraries were constructed in pUC-derived vectors after ran- 
dom mechanical shearing (nebulization) of genomic DNA. Se- 
quencing resulted in 23,127 reads, with an average read length of 
790 base pairs. The sequences were assembled using the Celera 
Assembler (6). The coverage criteria were that every position re- 
quired at least double-clone coverage (or sequence from a PCR 
product amplified from genomic DNA) and either sequence from 
both strands or two different sequencing chemistries. The se- 
quence was edited manually, and additional PCR and sequencing 
reactions were done to close gaps, improve coverage, and resolve 
sequence ambiguities (7). All repeated DNA regions were verified 
by PCR amplification across the repeat and sequencing of the 
product. The final assembly contains 1,959,987 bp, a G+C con- 
tent of 34%, and an estimated coverage of ~9X. 

The replication origin was determined by the colocalization of 
genes (dnaA, dnaN, recF, and gyrA) often found near the origin in 
prokaryotic genomes and GC nucleotide skew (G-C/G+C) anal- 
ysis (8). Completeness of the genome was assessed using the Phy- 
loSift software (9) to sequence for 40 highly conserved single-copy 



marker genes ( 10). Thirty-nine of these 40 markers were found in 
this assembly, and the missing marker (porphobilinogen deami- 
nase) was found in only 80% of the original 1 ,000 genomes used to 
generate the markers. 

An initial set of open reading frames likely to encode proteins 
(coding sequences [CDSs] ) were predicted as previously described 
(7). All predicted proteins >30 amino acids were searched against 
a nonredundant protein database, as previously described (7). 
Protein membrane-spanning domains were identified by Top- 
Pred (11). The 5' regions of each CDS were inspected to define the 
initiation codons using similarity searches, as well as the positions 
of ribosomal binding sites and transcriptional terminators. Two 
sets of hidden Markov models, Pfam version 11.0 (12) and TI- 
GRFAMs 3.0 (13), were used to determine CDS membership in 
families and superfamilies. Pfam version 11.0 hidden Markov 
models were also used, with a constraint of a minimum of two hits 
to find repeated domains within proteins and mask them. 

This resulted in 1,912 predicted protein coding sequences for 
D. thermophilum H-6-12 at the time of submission to Genbank 
(2008). 

Nucleotide sequence accession numbers. The genome se- 
quence has been deposited at GenBank under the accession no. 
CP001146. The version described in this paper is version 
CP001146.1. 
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